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ABSTRACT We have developed three computer pro- 
grams for comparisons of protein and DNA sequences. They 
can lie used to search sequence data bases, evaluate similarity 
scores, and identify periodic structures based on local se- 
quence similarity. The FASTA program is a more sensitive 
derivative of the FASTP program, which can be used to search 
protem or DNA sequence data bases and can compare a 
protein sequence to a DNA sequence data base by translatmg 
the DNA data base as it is searched. FASTA includes an 
additional step m the calculation of the initial pairwise suni- 
larity score that allows multiple regions of similarity to be 
jomed to mcrease the score of related sequences. The RDF2 
program can be used to evaluate the significance of sunilarity 
scores using a shuffling method that preserves local sequence 
composition. The LFASTA program can display all the re- 
gions of local similarity between two sequences with scores 
greater than a threshold, usuig the same scoring parameters 
and a sunilar alignment algorithm; these local similarities can 
be displayed as a "graphic matrix" plot or as mdividual 
alignments. In addition, tiiese programs have been generalized 
to allow comparison of DNA or protein sequences based on a 
variety of alternative scoring matrices. 



We have been developing tools for the analysis of protein 
and DNA sequence similarity that achieve a balance of 
sensitivity and selectivity on the one hand and speed and 
memory requirements on the other. Three years ago, we 
described the FASTP program for searching amino acid 
sequence data bases (1), which uses a rapid technique for 
finding identities shared between two sequences and exploits 
the biological constraints on molecular evolution. FAST? 
has decreased the time required to search the National 
Biomedical Research Foundation (NBRF) protein sequence 
data base by more than two orders of magnitude and has 
been used by many investigators to find biologically signifi- 
cant similarities to newly sequenced proteins. There is a 
trade-off between sensitivity and selectivity in biological 
sequence comparison: methods that can detect more dis- 
tantly related sequences (increased sensitivity) frequently 
increase the similarity scores of unrelated sequences (de- 
creased selectivity). In this paper we describe a new version 
of FASTP, FASTA, which uses an improved algorithm that 
increases sensitivity with a small loss of selectivity and a 
negligible decrease in speed. We have also developed a 
related program, LFASTA, for local similarity analyses of 
DNA or amino acid sequences. These programs run on 
commonly available microcomputers as well as on larger 
machines. 

METHODS 

The search algorithm we have developed proceeds through 
four steps in determining a score for pair- wise similarity. 
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FASTP and FASTA achieve much of their speed and selec- 
tivity in the first step, by using a loolcup table to locate all 
identities or groups of identities between two DNA or amino 
acid sequences during the first step of the comparison (2). 
The ktup parameter determines how many consecutive iden- 
tities are required in a match. For example, if ktup = 4 for a 
DNA sequence comparison, only those identities that occur 
in a run of four consecutive matches are examined. In the 
first step, the 10 best diagonal regions are found using a 
simple formula based on the number of ktup matches and the 
distance between the matches without considering shorter 
runs of identities, conservative replacements, insertions, or 
deletions (1, 3). 

In the second step of the comparison, we rescore these 10 
regions using a scoring matrix that allows conservative 
replacements and runs of identities shorter than ktup to 
contribute to the similarity score. For protein sequences, 
this score is usually calculated using the PAM250 matrix (4), 
although scoring matrices based on the minimum number of 
base changes required for a replacement or on an alternative 
measure of similarity can also be used with FASTA. For 
each of these best diagonal regions, a subregion with maxi- 
mal score is identified. We will refer to this region as the 
"initial region"; the best initial regions from Fig. lA are 
shown in Fig. IB. 

The FASTP program uses the single best scoring initial 
region to characterize pair- wise similarity; the initial scores 
are used to rank the library sequences. FASTA goes one 
step further during a library search; it checlcs to see whether 
several initial regions may be joined together. Given the 
locations of the initial regions, their respective scores, and a 
"joining" penalty (analogous to a gap penalty), FASTA 
calculates an optimal alignment of initial regions as a com- 
bination of compatible regions with maximal score. FASTA 
uses the resulting score to rank the library sequences. We 
limit the degradation of selectivity by including in the 
optimization step only those initial regions whose scores are 
above a threshold. This process can be seen by comparing 
Fig. IB with Fig. IC. Fig. IB shows the 10 highest scoring 
initial regions after rescoring with the PAM250 matrix; the 
best initial region reported by FASTP is marked with an 
asterisk. Fig. IC shows an optimal subset of initial regions 
that can be joined to form a single alignment. 

In the fourth step of the comparison, the highest scoring 
library sequences are aligned using a modification of the 
optiinization method described by Needleman and Wunsch 
(5) and Smith and Waterman (6). This final comparison 
considers all possible alignments of the query and library 
sequence that fall within a band centered around the highest 
scoring initial region (Fig. ID). With the FASTP program, 
optimization frequently improved the similarity scores of 
related sequences by factors of 2 or 3. Because FASTA 
calculates an initial similarity score based on an optimization 
of initial regions during the library search, the initial score is 
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Fig. 1. Identification of sequence similarities by FASTA. The 
four steps used by tiie FASTA program to calculate the initial and 
optimal similarity scores between two sequences are shown. (A) 
Identify regions of identity. (B) Scan the regions using a scoring 
matrix and save the best initial regions. Initial regions with scores 
less than the joining threshold (27) are dashed. The asterisk denotes 
the highest scoring region reported by FASTP. (C) Optimally join 
initial regions with scores greater than a threshold. The solid lines 
denote regions that are joined to make up the optimized initial score. 
(D) Recalculate an optimized alignment centered around the highest 
scoring initial region. The dotted lines denote the bounds of the 
optimized alignment. The result of this alignment is reported as the 
optimized score. 

much closer to the optimized score for many sequences. In 
fact, unlike FASTP, the FASTA method may yield initial 
scores that are higher than the corresponding optimized 
scores. 

Local Similarity Analyses. Molecular biologists are often 
interested in the detection of similar subsequences within 
longer sequences. In contrast to FASTP and FASTA, which 
report only the one highest scoring alignment between two 
sequences, local sequence comparison tools can identify 
multiple alignments between smaller portions of two se- 
quences. Local similarity searches can clearly show the 
results of gene duplications (see Fig. 2) or repeated struc- 
tural features (see Fig. 3) and are frequently displayed using 
a "graphic matrix" plot (7), which allows one to detect 
regions of local similarity by eye. Optimal algorithms for 
sensitive local sequence comparison (6, 8, 9) can have 
tremendous computational requirements in time and mem- 
ory, which make them impractical on microcomputers and, 
when comparing longer sequences, on larger machines as 
well. 

The program for detecting local similarities, LFASTA, 
uses the same first two steps for finding initial regions that 
FASTA uses. However, instead of saving 10 initial regions, 
LFASTA saves all diagonal regions with similarity scores 
greater than a threshold. LFASTA and FASTA also differ in 
the construction of optimized alignments. Instead of focus- 
ing on a single region, LFASTA computes a local alignment 
for each initial region. Thus LFASTA considers all of the 
initial regions shown in Fig. IB, instead of just the diagonal 
shown in Fig. ID. Furthermore, LFASTA considers not 
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only the band around each initial region but also potential 
sequence alignments for some distance before and after the 
initial region. Starting at the end of the initial region, an 
optimization (6) proceeds in the reverse direction until all 
possible alignment scores have gone to zero. The location of 
the maximal local similarity score in the reverse direction is 
then used to start a second optimization that proceeds in the 
forward direction. An optimal path starting from the forward 
maximum is then displayed (5). The local homologies can be 
displayed as sequence alignments (see Fig. 2B) or on a 
two-dimensional graphic matrix style plot (see Figs. 2A and 
3). 

Statistical Significance. The rapid sequence comparison 
algorithms we have developed also provide additional tools 
for evaluating the statistical significance of an alignment. 
There are approximately 5000 protein sequences, with 1.1 
million amino acid residues, in the NBRF protein sequence 
library, and any computer program that searches the library 
by calculating a similarity score for each sequence in the 
library will fmd a highest scoring sequence, regardless of 
whether the alignment between the query and library se- 
quence is biologically meaningful or not. Accompanying the 
previous version of FASTP was a program for the evaluation 
of statistical significance, RDF, which compares one se- 
quence with randomly permuted versions of the potentially 
related sequence. 

We have written a new version of RDF (RDF2) that has 
several improvements. (0 RDF2 calculates three scores for 
each shuffled sequence: one from the best single initial region 
(as found by FASTP), a second from the joined initial regions 
(used by FASTA), and a third from the optimized diagonal. 
(//■) RDF2 can be used to evaluate amino acid or DNA 
sequences and allows the user to specify the scoring matrix to 
be employed. Thus sequences found using the PAM250 
scoring matrix can be evaluated using the identity or genetic 
code matrix, (iii) The user may specify either a global or local 
shuffle routine. 

Locally biased amino acid or nucleotide composition is 
perhaps the most common reason for high similarity scores 
of dubious biological significance (10). High scoring align- 
ments between query and library sequences may be due to 
patches of hydrophobic or charged amino acid residues or to 
A-HT- or G + C-rich regions in DNA. A simple Monte Cario 
shuffle analysis that constructs random sequences by taking 
each residue in one sequence and placing it randomly along 
the length of the new sequence will break up these patches of 
biased composition. As a result, the scores of the shuffled 
sequences may be much lower than those of the unshuffled 
sequence, and the sequences will appear to be related. 
Alternatively, shuffled sequences can be constructed by 
permuting small blocks of 10 or 20 residues so that, while the 
order of the sequence is destroyed, the local composition is 
not. By shuffling the residues within short blocks along the 
sequence, patches of G-t-C- or A-l-T-rich regions in DNA, 
for example, are undisturbed. Evaluating significance with a 
local shuffle is more stringent than the global approach, and 
there may be some circumstances in which both should be 
used in conjunction. Whereas two proteins that share a 
common evolutionary ancestor may have clearly significant 
similarity scores using either shuffling strategy, proteins 
related because of secondary structure or hydropathic pro- 
file may have similarity scores whose significance decreases 
dramatically when the results of global and local shuffling 
are compared. 

Implementation. The FASTA/LFASTA package of se- 
quence analysis tools is written in the C programming lan- 
guage and has been implemented under the Unix, VAX/ 
VMS, and IBM PC DOS operating systems. Versions of the 
program that run on the IBM PC are limited to query se- 
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Table 1. FASTA and FASTP initial scores of the T-cell receptor 
(RWMSAV) versus the NBRF data base 

NBRF code Sequence FASTA FASTP 

RWHUAV T-cell receptor a chain 155 98 

KIHURE Ig K chain V-I region 127 111 

KVMS50 Ig <c chain V region 149 62 

KVMSM6 Ig K chain precursor V regions 141 64 

KVRB29 Ig K chain V region 126 54 

L3HUSH Ig A chain V-III region 90 47 

KVMS41 Ig K chain precursor V region 87 87 

RWMSBV T-cell receptor /3-chain precursor 94 94 

RWHUVY T-cell receptor jS-chain precursor 91 59 

RWHUGV T-cell receptor y-chain precursor 87 61 

RWHUT4 T-cell surface glycoprotein T4 86 63 

RWMSVB T-cell receptor i^chain precursor 71 41 

HVMS44 Ig heavy-chain V region 67 36 

GIHUDW Ig heavy-chain V-II region 62 35 

The average FASTP score = 26.1 ± 6.8 (mean ± SD). The 

average FASTA score = 26.2 ± 7.2 (mean ± SD). The mean and 
SD were computed excluding scores >54. V, Variable. 

quences of 2(X)0 residues; library sequences can be any 
length. Copies of the program are available from the authors. 

Although FASTA and LFASTA were designed for protein 
and DNA sequence comparison, they use a general method 
that can be applied to any alphabet with arbitrary match/ 
mismatch scoring values. All the scoring parameters, includ- 
ing match/mismatch values, values for the first residue in a 
gap and subsequent residues in the gap, and other parame- 
ters that control the number of sequences to be saved and 
the histogram intervals, can be specified without changing 
the program. 

EXAMPLES 

Comparison of FASTA with FASTP. To demonstrate the 
superiority of the FASTA method for computing the initial 
score, we compared the protein sequence of a T-cell receptor 
a chain (NBRF code RWMSAV) with all sequences in the 
NBRF protein data base^ and computed initial scores with 
both the present and previous methods. The T-cell receptor is 
a member of the immunoglobulin superfamily; in Release 12.0 
of the data base, this superfamily has 203 members. FASTP 
placed 160 immunoglobulin superfamily sequences in the 200 
top-scoring sequences; 57 related sequences received initial 
scores less than four standard deviations above the mean 
score. FASTA placed 180 superfamily members in the 200 
top-scoring sequences; only 20 related sequences scored 
below four standard deviations above the mean. Table 1 con- 
tains specific examples from this data base search. Although 
there is often little difference in the two methods, this ex- 
ample shows that in a number of cases the new method ob- 
tains significantly higher scores between related sequences. 

Nucleic Acid Data Base Searcli. FASTA can also be used to 
search DNA sequence data bases, either by comparing a 
DNA query sequence to the DNA library or by comparing an 
amino acid query sequence to the DNA library by translating 
each library DNA sequence in all six possible reading 
frames. We compared the 660-nucleotide rat transforming 
growth factor type a mRNA (GenBank locus RATTGFA) 
with all the mammalian sequences in Release 48 of Gen- 
Banlc§. We set ktup = 4 (see Methods), and the search was 
completed in under 15 min on an IBM PCAT microcom- 
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(Natl. Biomed. Res. Found., Washington, DC), Release 12. 

§EMBL/GenBank Genetic Sequence Database (1987) (Intelligenet- 
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Table 2. DNA data base search of rat transforming growth factor 



(RATTGFA) v( 


:rsus mammalian sequences 






GenBank 






Score 




Sequence 




Optimized 


HUMTFGAM 


Human TGF mRNA 


1336 


1618 


HUMTGFA2 


Human TGF gene (exon 2) 


354 


366 


HUMTGFAl 


Human TGF gene (5' end) 


224 


381 


MUSRGEB3 


Mouse 18S-5.8S-28S rRNA 


140 


107 


MUSRGE52 


gene 

Mouse 18S-5.8S-28S rRNA 


140 


107 


MUSMHDD 


gene 

MHC class I H-2D 


122 


78 


HUMMETIFl 


Metallothionein (MT)If gene 


116 


92 


MUSRGLP 


45S rRNA (5' end) 


115 


83 


HUMPS2 


pS2 mRNA 


105 


106 


MUSCIAII 


a-1 type I procollagen 


86 


89 



The 10 sequences having the highest initial scores are given. TGF, 
transforming growth factor; MHC, major histocompatibility com- 
plex. 



puter. The 10 top-scoring library sequences are shown in 
Table 2. Although it can be seen that the 3 top-scoring 
sequences are cleariy related to RATTGFA, there are other 
high-scoring sequences that are probably not related, and the 
mouse epidermal growth factor, found in the translated data 
base search (Table 3), is not found among the top-scoring 
sequences. 

To further examine the similarity detected between RAT- 
TGFA and MUSRGEB3, a mouse rRNA gene cluster, we 
used the RDF2 program for Monte Carlo analysis of statis- 
tical significance (the window for local shuffling was set to 10 
bases). Of the 50 shuffled comparisons (data not shown), 1 
obtained an initial score greater than 140 (the observed initial 
score), and 9 shuffled sequences obtained optimized scores 
greater than 107 (the observed optimized score). Therefore, 
the similarity between RATTGFA and MUSRGEB3 is un- 
likely to be significant. 

Translated Nucleic Acid Data Base Search. When searching 
for sequences that encode proteins, amino acid sequence 
comparisons are substantially more sensitive than DNA se- 
quence comparisons because one can use scoring matrices 
like the PAM250 matrix that discriminate between conserva- 
tive and nonconservative substitutions. A variant of FASTA, 
TFASTA, can be used to compare a protein sequence to a 
DNA sequence library; it translates the DNA sequences into 
each of six possible reading frames "on-the-fly." TFASTA 
translates the DNA sequences from beginning to end; it 
includes both intron and exon sequences in the translated 
protein sequence; termination codons are translated into 
unknown (X) amino acids. Table 3 shows the results of a 
translating search of the mammalian sequences in the Gen- 
Bank DNA data base using the RATTGFA protein sequence 
as the query and ktup = 1. In the translated search, the mouse 
epidermal growth factor now obtains an initial score higher 
than any unrelated sequences; however, HUMTGFAl, which 
was found in the DNA data base search but only contains 13 
translated codons, is no longer among the top scoring se- 
quences. 

Local Similarities. Fig. 2 displays the output of a local 
similarity analysis (ktup = 4) of CHPHBAIM, a chimpanzee 
al-globin mRNA, and RABHBAPT, a rabbit a-globin gene, 
including the complete coding sequence and a flanking 
pseudo-dj-globin gene. LFASTA can either display a graphic 
matrix style plot of the local homologies (Fig. 2A) or the 
alignments themselves (Fig. 2B). The right-most three align- 
ments (Fig. 2A) match the corresponding regions of the 
mRNA to exon subsequences from the pseudogene. We note 
that the FASTA initial score for the comparison of CHPH- 



RWHUAV 


T-cell receptor a chain 


155 


KIHURE 


Ig K chain V-I region 


127 


KVMS50 


Ig K chain V region 


149 


KVMSM6 


Ig K chain precursor V regions 


141 


KVRB29 


Ig K chain V region 


126 


L3HUSH 


Ig A chain V-III region 


90 


KVMS41 


Ig K chain precursor V region 


87 


RWMSBV 


T-cell receptor /3-chain precursor 


94 


RWHUVY 


T-cell receptor jS-chain precursor 


91 


RWHUGV 


T-cell receptor y-chain precursor 


87 


RWHUT4 


T-cell surface glycoprotein T4 


86 


RWMSVB 


T-cell receptor y^chain precursor 


71 


HVMS44 


Ig heavy-chain V region 


67 


GIHUDW 


Ig heavy-chain V-II region 


62 
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Table 3. Translated DNA data base search of rat transforming growth factor (RATTGFA) versus 
mammalian sequences 



GenBank 


Seauence 


Frame 


Initial 


Score 

Optimized 


RATTGFA 




Rat TGF type a 


1 


816 


816 


HUMTGFAM 


Human TGF mRNA 








HUMTGFA2 


Human TGF gene 


1 


204 


205 


MUSEGF 


Mouse EGF mRNA 


3 


93 


129 


MUSMHAB3 


Mouse MHC class II H2-IAp 


1 


91 


58 


MUSIGCD17 


Mouse Ig germ-line DJC region 


3' 


85 


48 


HUMESTR 


Human estrogen receptor 


3 


83 


65 


RATINSI 


Rat insulin 1 (Ins-l) gene 


2 


81 


63 


MUSTHYSl 


Mouse thymidylate synthase 


2 


80 


63 


HUMPNU3 


Human purine nucleoside phosphorylase 


1' 


80 


52 



The 10 sequences having the highest initial scores are given. TGF, transforming growth factor; EGF, 
epidermal growth factor; D, diversity; J, joining; C, constant; MHC, major histocompatibility 
complex. 



BAIM and RABHBAPT would be based on the three globin 
gene exons, while the FASTP initial score would be based on 
a single conserved exon. 

The Smith-Waterman optimization used in the LFASTA 
program allows the detection of more subtle features than 
can be detected by the eye using a graphic matrix plot, 
because the path traced is locally optimal, even though it 
may only have a slightly higher density of identities and 
conservative replacements. Fig. 3 shows a plot from a local 
similarity self-comparison of the myosin heavy chain from 
the nematode Caenorhabditis elegans (MWKW) using the 
PAM250 matrix. The amino-terminal half of the molecule 
forms a large globular head without any periodic structure; 
the solid line down the main diagonal represents the ex- 
pected identity of the sequence with itself. The symmetrical 
parallel lines along the carboxyl-terminal half of the mole- 
cule correspond to the 28-residue repeat responsible for the 
a-helical coiled-coil structure of the rod segment. 

DISCUSSION 

In searching a data base, one is attempting to measure 
relatedness; in aligning two homologous sequences, one is 



trying to choose the most likely set of mutations since their 
divergence from a common ancestral sequence. Thus any 
tool for the analysis of sequence similarities must contain 
within it an implicit model of molecular evolution. An 
algorithm that guarantees the optimality of its alignments 
based on a set of scoring rules must be judged on how well 
these rules fit our current understanding of the process of 
molecular evolution. Algorithms that sacrifice realism to 
achieve greater efficiency, regardless of their mathematical 
rigor, require careful empirical evaluation. 

Even though the tools we have developed use rigorous 
algorithms at each step and incorporate a realistic model of 
evolution, their hierarchical nature make them heuristic. The 
original FASTP program has had the benefit of extensive use 
and evaluation by a wide variety of scientists. The FASTA 
program exploits refinements of the previous approach that 
result in a significant improvement in sensitivity. The LFA- 
STA local similarity analysis program is also a logical ex- 
tension of the FASTP approach. 

Because of the trade-offs between sensitivity and selectiv- 
ity in data base searches, the results of any search, and 
particulariy those that result in alignment scores that are not 
clearly separated from the distribution of all library sequence 
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B 

10 20 30 40 50 60 

CHPHBA GACTCAGAAAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTCAAGGCCG 



RABHBA GACTGAGAAGGAA-CCACCATGGTGCTGTCTCCCGCTGACAAGACCAACATCAAGACTG 
180 190 200 210 220 

70 80 90 100 110 

CHPHBA CCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGG 

RABHBA CCTGGGAAAAGATCGGCAGCCACGGTGGCGAGTATGGCGCCGAGGCCGTGGAGAGG 
230 240 250 260 270 280 

Fig. 2. Local comparison of an a-globin mRNA sequence with an a-globin gene cluster. An ape uj-globin mRNA sequence (GenBank 
sequence CHPHBAIM) was compared with a rabbit a-globin gene sequence (RABHBAPT) containing a second pseudo-O-globin gene using the 
LFASTA program. (A) A plot of the homologous regions shared by the two sequences. (B) One of the alignments between the mRNA sequence 
and the rabbit a-globin gene (nucleotides 171-855). Three other alignments between the mRNA sequence and the a-globin gene and three 
alignments between the pseudo-S-globin gene (nucleotides 3200-3770) were calculated but are not shown. There is 84.3% identity in the 115 
nucleotide overlap. The initial region and optimized scores using LFASTA are 284 and 304, respectively. X denotes the ends of the initial region 
found by LFASTA. 
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scores, must be carefully evaluated (1, 11). The Monte Carlo 
analysis of statistical significance provided by a program 
such as RDF2 can often be critical in evaluating a borderline 
similarity. Previously we suggested ranges of z values [(ob- 
served score - mean of shuffled scores)/standard deviation 
of shuffled scores] corresponding to approximate signifl- 
cance levels. However the z values determined in a Monte 
Carlo analysis become less useful as the distribution of 
shuffled scores diverges from a normal distribution, as is 
found with FASTA. Therefore, we now focus on the highest 
scores of the shuffled sequences. For example, if in 50 
shuffled comparisons, several random scores are as high or 
higher than the observed score, then the observed similarity 
is not a particularly unlikely event. One can have more 
confidence if in 200 shuffled comparisons, no random score 
approaches the observed score. In general, our experience 
has led us to be conservative in evaluating an observed 
similarity in an unlikely biological context. 

These programs provide a group of sequence analysis 
tools that use a consistent measure for scoring similarity and 
constructing alignments. FASTA, RDF2, and LFASTA all 
use the same scoring matrices and similar alignment algo- 
rithms, so that potentially related library sequences discov- 



FiG. 3. Repeated structure in the 
myosin heavy chain. LFASTA was used 
to compare the Caenorhabditis elegans 
myosin heavy chain protein sequence 
(NBRF code MWKW) with itself using 
the PAM250 scoring matrix. The solid, 
dashed, and dotted lines denote decreas- 
ing similarity scores. The solid lines had 
initial region scores greater than 80 and 
optimized local scores greater than 150; 
the longer dashed lines had initial region 
and optimized local scores greater than 
65 and 120, respectively, and the shorter 
dashed lines had initial region and opti- 
mized local scores greater than 50 and 
100, respectively. Homologous regions 
with lower scores are plotted with dots. 



ered after the search of a sequence data base can be 
evaluated further from a variety of perspectives. In addition, 
LFASTA can also show alternative alignments between 
sequences with periodic structures or duplications. 
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